Using Multiple Discriminant Analysis Approach for Linear Text Segmentation

نویسندگان

  • Jingbo Zhu
  • Na Ye
  • Xingzhi Chang
  • Wenliang Chen
  • Benjamin Ka-Yin T'sou
چکیده

Research on linear text segmentation has been an on-going focus in NLP for the last decade, and it has great potential for a wide range of applications such as document summarization, information retrieval and text understanding. However, for linear text segmentation, there are two critical problems involving automatic boundary detection and automatic determination of the number of segments in a document. In this paper, we propose a new domain-independent statistical model for linear text segmentation. In our model, Multiple Discriminant Analysis (MDA) criterion function is used to achieve global optimization in finding the best segmentation by means of the largest word similarity within a segment and the smallest word similarity between segments. To alleviate the high computational complexity problem introduced by the model, genetic algorithms (GAs) are used. Comparative experimental results show that our method based on MDA criterion functions has achieved higher Pk measure (Beeferman) than that of the baseline system using TextTiling algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experiments in Unconstrained Offline Handwritten Text Recognition

A system for off-line handwritten text recognition is presented. It is characterized by a segmentation-free approach, i.e. whole lines of text are processed by the recognition module. The methods used for pre-processing, feature extraction, and statistical modelling are described, and several experiments on writer-independent, multiple writer, and single writer handwriting recognition tasks are...

متن کامل

Recursive Algorithms for Image Segmentation Based on a Discriminant Criterion

In this study, a new criterion for determining the number of classes an image should be segmented is proposed. This criterion is based on discriminant analysis for measuring the separability among the segmented classes of pixels. Based on the new discriminant criterion, two algorithms for recursively segmenting the image into determined number of classes are proposed. The proposed methods can a...

متن کامل

A Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space

A novel algorithm for motion segmentation is proposed. The algorithm uses the fact that shape of an object with homogeneous motion is represented as 4 dimensional linear space. Thus motion segmentation is done as the decomposition of shape space of multiple objects into a set of 4 dimensional subspace. The decomposition is realized using the discriminant analysis of orthogonal projection matrix...

متن کامل

Text Segmentation with Topic Modeling and Entity Coherence

This paper describes a system which uses entity and topic coherence for improved Text Segmentation (TS) accuracy. First, Linear Dirichlet Allocation (LDA) algorithm was used to obtain topics for sentences in the document. We then performed entity mapping across a window in order to discover the transition of entities within sentences. We used the information obtained to support our LDA-based bo...

متن کامل

A Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling

In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005